-
Notifications
You must be signed in to change notification settings - Fork 878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weighted load balancing policy #1922
base: 4.x
Are you sure you want to change the base?
Conversation
Additionally, this will tackle JAVA-3040 |
I'd put configuration documentation in |
* | ||
* The default weights are good for the vast majority of use cases, but you can tweak them to get different behavior. | ||
*/ | ||
public class RackAwareWeightedLoadBalancingPolicy extends DefaultLoadBalancingPolicy { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is going to be most useful in environments like AWS, where Cassandra racks correspond to AWS AZs, and there are relatively few racks, and a high likelihood that an application has some instances running in the same rack as Cassandra. In other environments where racks correspond to actual racks in a datacenter, there are many more racks, and the likelihood of an application running in the same rack is low, there's not as much benefit to prioritizing rack-alignment as heavily.
Does that sound right based on your experience?
It would be useful to have a metric for the count of requests that go to each category (rack-aligned, replica-aligned, instance-starting, instance-unhealthy), so users can know whether they're setting local rack correctly, set weights correctly, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's certainly more valuable for cases where the rack->rack latency is higher. However, being rack-aware is just one part of it, I think the mechanism around preferring nodes with fewer in-flight requests is an improvement over the round-robin mechanism (assuming non-token-aware, or all replicas) in DefaultLoadBalancingPolicy/BasicLoadBalancingPolicy. Additionally, the getAndUpdate in those balancers ends up being a bottleneck in high-throughput situations. It's been a while since our ApacheCon talk and there's no recording but the slides are available here https://www.apachecon.com/acna2022/slides/04-Khaku-Lynch_Improving-Cassandra-Client.pdf and they go into the problems a bit.
My eventual goal is to make this the default load balancing policy in the driver but for now I'll settle for getting it in and making it an option.
Regarding the metrics - that's perhaps a little tricky since we're creating a query plan here rather than firing off requests. Additionally, the scoring/ordering is decoupled from the characteristics that resulted in the score - perhaps some trace logging with the characteristics? In practice we saw latencies drop immediately when we deployed this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think the changes outside of rack-awareness make this worth recommending broadly. I wouldn't support making it the default until it's been available for at least a release, but worth pushing it in that direction.
|
||
// By default we will only score this many nodes, the rest will get added on without scoring. | ||
// We don't usually need to score every single node if there are more than a few. | ||
static final int DEFAULT_SCORED_PLAN_SIZE = 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like it would make sense to increase this value if using a large cluster and lots of token-unaware queries (ALLOW FILTERING
, WHERE TOKEN
, WHERE pk IN
), since replica-local does not apply but rack-local does, and in a larger cluster the likelihood of an 8-node random sample including a rack-local Cassandra host is lower than some might like. Increasing the scored plan size will increase the likelihood that a rack-local host is used out of the sample.
Does that sound right to you? Any other situations to tune this based on your experience?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair, this default was chosen to fit with our requirements but a larger default may make sense for a more general use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a reasonable default - just thinking through what we should document for tuning.
private final double weightNonRack; | ||
private final double weightNonReplica; | ||
private final double weightStarting; | ||
private final double weightUnhealthy; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like something that was lost in the 4.x rewrite was the error-aware LBP: https://github.com/apache/cassandra-java-driver/blob/3.x/driver-core/src/main/java/com/datastax/driver/core/policies/ErrorAwarePolicy.java
Not requesting any change here - just realized that this doesn't exist anymore. Could be useful to incorporate those ideas in the future.
} | ||
|
||
protected double getWeightedScore(Node node, Session session, long nowMillis, long nowNanos, boolean isReplica) { | ||
int base = Math.min(32768, 1 + getInFlight(node, session)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getInFlight
calls InFlightHandler#getInFlight
, which isn't thread-safe. StreamIdGenerator#maxAvailableIds
is a plain int, so could be cached on a calling thread. Risks of this value being inaccurate are low, could update maxAvailableIds
to be a volatile, either way not a huge concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maxAvailableIds
is a final
instance variable on StreamIdGenerator
so I think we're safe here, unless I misunderstand how that works. And availableIds()
returns the value of an AtomicInteger, so I think that means that InFlightHandler#getInFlight
is thread-safe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right - it's final, missed that.
} | ||
|
||
// Take a copy of nodes and reference to replicas since the node map is concurrent | ||
Set<Node> dcNodeSet = getLiveNodes().dc(getLocalDatacenter()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a user sets a local DC, the current implementation only plans and scores for Cassandra hosts in the local DC. If a user doesn't set a local DC, there's no weighting to encourage the query plan to select a coordinator in the local DC over one in a remote DC, right? If there's no local DC configured, getLocalDatacenter
is empty, and dcNodeSet
ends up being all nodes.
If a user uses all non-local consistency levels (QUORUM, etc.) and doesn't want all traffic to go to a coordinator in the local DC but does want to encourage the query planner to pick coordinators in the local DC, how would they configure that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the RackAwareWeightedLoadBalancingPolicy extends DefaultLoadBalancingPolicy, which uses MandatoryLocalDcHelper
and so requires a local DC to be set. I can however see an argument where it instead extends BasicLoadBalancingPolicy
, in a world where RAWLBP is a more general solution with sensible defaults (kinda like UCS is for Cassandra compaction?) and you can tweak weights to get what you want.
Regarding non-local consistency level - if we allow scoring non-local-DC nodes, I think we'd still want to weight the local ones higher and rely on the server-side snitch to figure out a quorum partner if needed.
The weighted load balancing policy uses a number of factors as weights along with the number of in-flight requests for each host to score and then sort the list of known live hosts.
This PR open-sources the work Joey Lynch and I did and presented at ApacheCon 2022. It requires a bit of work before it's mergeable, but hoping to get some feedback on a couple areas before I put in much more work. A couple major things left to do: